Skip to content

[PMON HLD] update get_reboot_cause mechanism and add get_midplane_dow…#2385

Open
chartsai-nvidia wants to merge 1 commit into
sonic-net:masterfrom
chartsai-nvidia:chartsai/pmon-hld-update
Open

[PMON HLD] update get_reboot_cause mechanism and add get_midplane_dow…#2385
chartsai-nvidia wants to merge 1 commit into
sonic-net:masterfrom
chartsai-nvidia:chartsai/pmon-hld-update

Conversation

@chartsai-nvidia

Copy link
Copy Markdown

Why I did it

Refines the SmartSwitch PMON HLD for DPU reboot-cause and midplane-down handling:

  • The old design assumed the NPU could read a DPU reboot-cause while the DPU was dead, triggered
    only by an offline→online transition. Now the cause is captured only when the midplane is online,
    using a per-boot boot_id so chassisd reliably detects a real DPU reboot.
  • Adds a get_midplane_down_reason() platform API and documents planned vs. unplanned midplane-down
    reasons.
Work item tracking
  • Microsoft ADO (number only): N/A

How I did it

  • Reworked the DPU Reboot Cause flow around boot_id: the DPU publishes a fresh per-boot UUID into
    CHASSIS_STATE_DB; the NPU chassisd compares it to the last persisted value and calls
    get_reboot_cause() only on a real reboot with midplane up.
  • Added boot_id to the REBOOT_CAUSE and DPU_STATE schema examples.
  • Documented up→down midplane handling (planned via transition flag vs. unplanned via
    get_midplane_down_reason()) and added the new API definition.

…n_reason

The commit updates 2 main parts:
- when to run get_reboot_cause
- get_midplane_down_reason

Signed-off-by: Charles Tsai <chartsai@nvidia.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants